Question 1

a.)

False, we would not need 3 dummy variables, only 2. If we did 3, we would be creating a linear combination of the first two columns. This is known as the dummy variable trap.

b.)

False, we do not prefer such high order models because they have high sampling variability and risk overfitting the data.

c.)

True, since the terms are multiplied, the scale of the interaction depends on the value of the terms multipled.

d.)

False, separate models, while more flexible, are not as efficient and have more sampling variablity than a joint model.

Question 2

# Libraries
library(MASS)
library(ggplot2)
library(plotly)
library(GGally)

a.)

cars <- read.csv("Cars.csv")
cars

b.)

# Fixing horsepower column
cars$horsepower <- as.numeric(cars$horsepower)
## Warning: NAs introduced by coercion
# Find index of NA rows
NArows <- which(is.na(cars))

# View rows
cars[NArows,]

c.)

str(cars)
## 'data.frame':    397 obs. of  7 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ country.code: int  1 1 1 1 1 1 1 1 1 1 ...

We should treat the variables “cylinders” and “country.code” as categorical variables. The rest of the variables should be treated as quantitative variables.

cars$cylinders <- as.factor(cars$cylinders)
cars$country.code <- as.factor(cars$country.code)

d.)

# MPG
ggplot(data = cars, aes(x = mpg)) + geom_histogram(bins = 15)

#Displacement
ggplot(data = cars, aes(x = displacement)) + geom_histogram(bins = 10)

# Horsepower
ggplot(data = cars, aes(x = horsepower))+ geom_histogram(bins = 20)
## Warning: Removed 5 rows containing non-finite values (stat_bin).

# Weight
ggplot(data = cars, aes(x = weight)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Acceleration
ggplot(data = cars, aes(x = acceleration)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histograms we notice the following: is slightly right skewed, displacement is heavily right skewed, horsepower and weight is moderately right skewed, and acceleration is approximately normal.

e.)

ggpairs(cars[,c(1, 3, 4, 5, 6)], title = "Correlogram of Cars Data")

We can see from the scatter plot matrix that we face multicollinearity with predictors in this data. The variable mpg, displacement, horsepower, and weight are all highly correlated with each other, with the highest being the correlation between displacement and weight with a correlation of \(0.933\).

f.)

library(plotly)
fig1 <-
  plot_ly(
    data = data.frame(table(cars$cylinders)),
    labels = ~ Var1,
    values = ~ Freq,
    type = 'pie'
  )
fig1 <-
  fig1 %>% layout(
    title = "Number of Cylinders",
    xaxis = list(
      showgrid = FALSE,
      zeroline = FALSE,
      showticklabels = FALSE
    ),
    yaxis = list(
      showgrid = FALSE,
      zeroline = FALSE,
      showticklabels = FALSE
    )
  )
fig1
fig2 <-
  plot_ly(
    data = data.frame(table(cars$country.code)),
    labels = ~ Var1,
    values = ~ Freq,
    type = 'pie'
  )
fig2 <-
  fig2 %>% layout(
    title = "Pie Chart of Country Code",
    xaxis = list(
      showgrid = FALSE,
      zeroline = FALSE,
      showticklabels = FALSE
    ),
    yaxis = list(
      showgrid = FALSE,
      zeroline = FALSE,
      showticklabels = FALSE
    )
  )
fig2

g.)

ggplot(data = cars, aes(x = cylinders, y = mpg, fill = country.code)) + geom_boxplot()

We can seet that only cylinder options that were available in all three countries were 4 cylinders and 6 cylinders. Some countries only had one cylinder option. Four cylinder cars performed, on average, the best in miles per gallon, regardless of the country. Eight cylinder cars had the worse mpg performance. Four cylinder cars in country code 3 performed the best of all other categories on average.

Question 3

a.)

ggplot(data = cars, aes(x = mpg)) + geom_histogram(bins = 20)

cars.Model <- lm(mpg ~ cylinders + horsepower + weight, data = cars)

bc <- boxcox(cars.Model)

From the histogram and the boxcox plot, we can see that mpg is moderately right skewed, because of this, we will perform a transformation chosen by the boxcox transformaton.

# Find best lambda
lambda <- bc$x[which.max(bc$y)]

# make new data column
cars$mpgTrans <- (cars$mpg^lambda - 1)/ lambda

b.)

cars.Model.T <- lm(mpgTrans ~ cylinders + horsepower + weight, data = cars)
summary(cars.Model.T)
## 
## Call:
## lm(formula = mpgTrans ~ cylinders + horsepower + weight, data = cars)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.125927 -0.027564 -0.001233  0.031690  0.161618 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.018e+00  2.733e-02  73.833  < 2e-16 ***
## cylinders4   7.577e-02  2.274e-02   3.332 0.000945 ***
## cylinders5   1.049e-01  3.474e-02   3.018 0.002712 ** 
## cylinders6   3.811e-02  2.354e-02   1.619 0.106239    
## cylinders8   4.402e-02  2.521e-02   1.746 0.081652 .  
## horsepower  -8.508e-04  1.345e-04  -6.326 6.98e-10 ***
## weight      -6.122e-05  6.939e-06  -8.822  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04474 on 385 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8223, Adjusted R-squared:  0.8196 
## F-statistic:   297 on 6 and 385 DF,  p-value: < 2.2e-16

From the F test, we can see a very small p-value, indicating that at least one of the coefficients is significant. Both weight and horsepower are highly significant. Cars with 4 and 5 cylinders are significant however those with 6 and 8 cylinder engines are not. The \(R^2\) and \(R^2_{adj}\) are 0.8223 and 0.8196 respectively, indicating a strong fit. The model is adequite.

c.)

\[ \hat{Y_i} = (2.018 + 0.07577) - 0.00085X_{hp} - 0.000061X_{wt} \] ### d.)

cars.Model.Int <- lm(mpgTrans ~ cylinders + horsepower + weight + cylinders*horsepower + cylinders*weight, data = cars)
summary(cars.Model.Int)
## 
## Call:
## lm(formula = mpgTrans ~ cylinders + horsepower + weight + cylinders * 
##     horsepower + cylinders * weight, data = cars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12550 -0.02515  0.00000  0.02787  0.17560 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)
## (Intercept)           -3.418319  10.587892  -0.323    0.747
## cylinders4             5.521017  10.587914   0.521    0.602
## cylinders5             5.940183  10.594256   0.561    0.575
## cylinders6             5.429651  10.588039   0.513    0.608
## cylinders8             5.461130  10.587969   0.516    0.606
## horsepower             0.263444   0.562359   0.468    0.640
## weight                -0.008731   0.018858  -0.463    0.644
## cylinders4:horsepower -0.264935   0.562359  -0.471    0.638
## cylinders5:horsepower -0.268299   0.562362  -0.477    0.634
## cylinders6:horsepower -0.263218   0.562359  -0.468    0.640
## cylinders8:horsepower -0.264355   0.562359  -0.470    0.639
## cylinders4:weight      0.008688   0.018858   0.461    0.645
## cylinders5:weight      0.008648   0.018858   0.459    0.647
## cylinders6:weight      0.008650   0.018858   0.459    0.647
## cylinders8:weight      0.008677   0.018858   0.460    0.646
## 
## Residual standard error: 0.04373 on 377 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8337, Adjusted R-squared:  0.8276 
## F-statistic:   135 on 14 and 377 DF,  p-value: < 2.2e-16

\[ \hat{Y_i} = (-3.418 + 5.521) + (0.263 - 0.265)X_{hp} + (-0.009 + 0.009)0.000061X_{wt} \]

e.)

anova(cars.Model.Int, cars.Model.T)

We can see from the anova function, we would reject \(H_0\) concluding that the full model with interactions is a better fit.

f.)

# For non interaction model
predict(cars.Model.T, data.frame(cylinders = "4", horsepower = 100, weight = 3000), interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 1.824593 1.736006 1.913179
# With interaction model
predict(cars.Model.Int, data.frame(cylinders = "4", horsepower = 100, weight = 3000), interval = "prediction", level = 0.95)
##        fit     lwr      upr
## 1 1.822971 1.73587 1.910073

We find that both predictions are almost identical to each other for both models.